Analysis of Document Structures for Element Type Classification

نویسندگان

  • Helena Ahonen
  • Barbara Heikkinen
  • Oskari Heinonen
  • Jani Jaakkola
  • Mika Klemettinen
چکیده

As more and more digital documents become available for the public use from diierent sources, also the needs of the users increase. Seamless integration of heterogenous collections, e.g., a possibility to query and format documents in a uniform way, is one of these needs. Processing of documents is greatly enhanced if the structure of documents is explicitly represented by some standard (SGML, XML, HTML). Hence, the problem of integrating heterogenous structures has to be taken into consideration. We address this problem by introducing a classiication method that acquires knowledge from document instances and their document type deenitions, and uses this knowledge to attach a generic class to each SGML element type. The classiication retains the tree hierarchy of elements. Although the structure is simpliied, enough distinctions remain to facilitate versatile further processing, e.g., formatting. The class of an element type can be stored in the document type deenition and, using the architectural form feature of SGML, the documents can be processed as virtual documents obeying a pre-deened generic DTD. The speciic usages of the classiication, in addition to formatting and querying, include assembly of new documents from existing document fragments and automatic generation of style sheet templates for original document type deenitions. We have implemented the classiication method and experimented with several document types.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Investigation of Nonlinear Behavior of Composite Bracing Structures with Concrete Columns and Steel Beams (RCS) Applying Finite Element Method

The composite structural system (RCS) is a new type of moment frame, which is including a combination of concrete columns (RC) and steel beams (S). These structural systems have the advantages of both concrete and steel frames [1]. In previous research on composite structures, there are some studies regarding RCS composite conections, but there is no investigation about seismic resisting system...

متن کامل

A New Three-Dimensional Sector Element for Circular Curved Structures Analysis

In this research paper, the formulation of a new three-dimensional sector element based on the strain approach is presented for plate bending problems and linear static analysis of circular structures. The proposed element has the three essential external degrees of freedom (Ur, Vθ and W) at each of the eight corner nodes. The displaceme...

متن کامل

Appropriate Loading Techniques in Finite Element Analysis of Underground Structures

Stability of underground structures is assessed by comparing rock strength with induced stresses resulted from ground stresses. Rock mass surrounding the opening may fail either by fracture or excessive deformation caused. Accurate calculation of induced stresses is therefore fundamental in the stability analysis of an opening. Although numerical methods, particularly finite element method, are...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998